Genome Medicine — Latest Matching Preprints

1

OmicsPred as a centralised resource for genetic prediction of multi-omic traits

Foguet, C.; Gil, L.; Xu, Y.; Salazar-Magana, S.; Rtichie, S. C.; Persyn, E.; Im, H. K.; Inouye, M.; Lambert, S. A.

2026-05-19 genetic and genomic medicine 10.64898/2026.05.15.26353298 medRxiv

Top 0.1%

39.4%

Show abstract

Genetic prediction of multi-omic data has emerged as a cost-effective alternative to direct omics profiling, particularly useful for identifying molecular features associated with disease susceptibility. However, despite its popularity, multi-omic imputation models are fragmented across studies, hindering findability, accessibility, interoperability and re-use. To address this, we developed OmicsPred (https://www.omicspred.org), a centralised platform for the deposition and dissemination of genetic prediction models of multi-omic traits. OmicsPred unifies the most commonly used molecular imputation models (e.g. from PredictDB) and other published studies totalling 3,339,469 prediction models spanning transcriptomic, proteomic, and metabolomic traits (as of May 2026). Each model is accompanied by metadata describing score development and predictive performance, and distributed in formats compatible with popular analytic tools, such as PGS Catalog Calculator and MetaXcan. To demonstrate the utility of the resource for systematic target discovery, we perform a multi-omic phenome-wide association analysis in Million Veterans Program data.

2

Automated GenePy Gene-Burden Computation via a Reproducible Nextflow Workflow Integrated with the Genomics England (GEL) Lifebit Platform

Nazari, I.; Ennis, S.; Ashton, J.; Cheng, G.

2026-05-24 genetic and genomic medicine 10.64898/2026.05.22.26353863 medRxiv

Top 0.1%

34.2%

Show abstract

Interpretation of rare-disease genomes remains constrained by variant-centric analytical frameworks that insufficiently capture the cumulative impact of multiple variants within a gene. GenePy provides an individual-level, gene-based burden metric that integrates variant consequence, allele frequency, and zygosity into a unified quantitative score, enabling a transition from discrete variant annotation to aggregated gene-level interpretation. In the context of Genomics England, this formulation supports a panel-agnostic, genotype-to-phenotype diagnostic strategy for unresolved monogenic disorders by prioritising genes with elevated mutational burden per individual. Here, we present a fully automated, containerised GenePy workflow deployed through Nextflow and integrated within the Genomics England (GEL) Research Environment via the Lifebit CloudOS platform. This implementation provides scalable, secure, and governance-compliant computation of gene-level burden scores across population-scale cohorts. The workflow harmonises variant annotation, quality control, and chunked data aggregation within modular, reproducible processes designed for high-throughput execution on cloud-native infrastructure. By enabling robust, portable, and auditable gene-level scoring across large rare-disease sequencing datasets, this framework enhances analytical resolution and supports downstream statistical prioritisation, integrative phenotype matching, and hypothesis generation within genotype-to-phenotype diagnostic workflows.

3

Unbiased Long-Read Whole-Genome Sequencing Enables High-Resolution Mapping of Transgene Concatenation and Off-target Genomic Disruption in a Mouse Model

Mehta, M.; Ahmed, K.; Hussein, R.; Tavares, E.; Berberovic, Z.; Adele, R.; D'Souza, A.; Gu, B.; Wilson, M. D.; Ivakine, E.; Monnier, P. P.; Heon, E.; Vincent, A.

2026-05-18 genomics 10.64898/2026.05.15.725597 medRxiv

Top 0.1%

29.2%

Show abstract

Transgenic mouse models are indispensable for dissecting disease mechanisms; yet, their interpretability is frequently compromised by cryptic genomic alterations introduced during transgenesis. Thus, robust quality control strategies are needed to elucidate integration architecture and evaluate model performance when such unintended events occur. Here, we applied unbiased whole-genome long-read sequencing using the PacBio Revio to investigate a mouse model exhibiting unexpected transgene silencing, originally designed to recapitulate autosomal-dominant hereditary macular dystrophy driven by upregulation of a ZZEF1-ALOX15 fusion gene. Long-read sequencing analysis revealed a [≥]29-kb head-to-tail concatemer containing more than three copies of the transgene vector. Reconstruction of transgene-genome junctions revealed off-target integration of the concatemer into the calcium-sensing receptor gene (Casr), along with exogenous E. coli DNA, that together defined final transgene architecture. 5-methylcytosine profiling identified hypermethylation of the transgene promoter and additional phenotyping indicated disruption of endogenous Casr function resulting from the rearrangement. Our workflow enabled direct detection of transgene concatenation and off-target mapping. These findings establish long-read sequencing as a powerful and scalable quality control standard for genetically engineered animal models, uniquely capable of uncovering hidden genomic complexity, resolving aberrant phenotypes, and enhancing the reliability of in vivo disease modelling.

4

Genetic profiling of soft tissue and bone tumors using SarcDBase

Difilippo, V.; Saba, K. H.; Wallander, K.; Styring, E.; Nathrath, M.; Baumoer, D.; Haglund de Flon, F.; Nord, K. H.

2026-05-15 genomics 10.64898/2026.05.13.724790 medRxiv

Top 0.1%

28.6%

Show abstract

To streamline molecular profiling of tumor biopsies, we developed SarcDBase, an openly accessible tool that extracts and interprets clinically relevant genetic alterations from next-generation sequencing data. By automatically linking identified variants to curated, user-defined reference lists, SarcDBase minimizes the need for specialized expertise and reduces the burden of manual data processing. The platform delivers detailed molecular profiles, diagnostic insights and an intuitive interface for comprehensive interpretation. SarcDBases performance was evaluated in a heterogeneous cohort of 204 deep-sequenced bone and soft tissue tumors. In most cases (81%), its interpretation closely matched the curated post-sequencing diagnosis. Discrepancies mainly occurred in samples lacking diagnostically informative mutations. In some instances, SarcDBase flagged rare or unexpected alterations, including previously unreported gene fusions. This highlights SarcDBases dual potential as both an interrogative research tool and facilitator of molecular diagnostics, especially for reclassification of diagnostically challenging tumor types.

5

Generalist large language models complement tailor-made predictors for tumor genomics interpretation

Yu, J.; Darmofal, M.; Waters, M.; Choy, J.; Tran, T. N.; Fu, C.; Morales, L.; U, K.; Levine, R. L.; Schultz, N.; Berger, M. F.; Morris, Q.; Jee, J.

2026-05-22 genomics 10.64898/2026.05.21.726957 medRxiv

Top 0.1%

28.3%

Show abstract

General-purpose large language models (LLMs) are trained on large corpora to acquire broad knowledge, but whether LLMs can replace, or augment, task-specific models is unclear. We evaluated LLMs on three real-world, clinically important tumor genomic interpretation tasks, in order of increasing difficulty: (i) distinguishing tumor from non-tumor mutations (n=34,415 variants), (ii) distinguishing driver from passenger mutations (n=13,469 variants), and (iii) inferring cancer type from tumor sequencing reports across multiple assays and institutions (n=102,791 samples). The best general-purpose LLMs performed as well as the benchmark tailor-made predictor for task (i). Ensembling tailor-made models with zero-shot LLMs improved their performance for tasks (i) and (ii). For task (iii), LLMs outperformed or supplemented tailor-made models on out-of-distribution data. Without fine-tuning, current LLMs already can be useful in clinical genomic interpretation by adding complementary expertise to tailor-made, state-of-the-art predictors.

6

A harmonized single-cell RNA-seq atlas of human localized and metastatic prostate cancers and benign tissues

Cho, H.; Zhang, Y.; Zhou, J.; Daggar, A.; Kang, S.; Mannan, R.; Cao, X.; Dhanasekaran, S. M.; Chinnaiyan, A. M.

2026-05-20 cancer biology 10.64898/2026.05.18.725966 medRxiv

Top 0.1%

23.5%

Show abstract

Single-cell RNA sequencing (scRNA-seq) effectively captures the differences in transcriptomic landscape of cell types and cell states between benign and cancer tissues. Pooling publicly available datasets distributed across independent studies enables increased sample representation and cross-study comparisons. Here we present a harmonized scRNA-seq atlas of the human prostate constructed by integrating 17 available studies, comprising 163 samples from 106 donors. The dataset contains benign tissue, primary tumors, and metastatic disease profiles. Raw sequencing FASTQ data files were uniformly reprocessed to minimize technical variability. Study metadata were curated and standardized using a unified schema capturing donor identity, tissue site, disease context, and histologic grade. Post quality control, the integrated dataset contains 754,000 high-quality cells. Harmonized cell type annotations were generated using a pseudobulk correlation framework informed by multiple reference resources. The workflow identified 17 distinct cell types representing epithelial, mesenchymal, and immune compartments of the prostate. The processed expression matrices, standardized metadata, and analysis workflows are publicly available to support reproducible analysis and enable exploration of heterogeneity across prostate disease states.

7

Linear plasmid prevalence and linezolid resistance gene carriage in vancomycin-resistant Enterococcus in Canada from 2009-2024

Lerminiaux, N.; McCracken, M.; Bartoszko, J. J.; Grewal, G.; Ahmed, S.; Johnstone, J.; Golding, G. R.; CNISP VRE working group,

2026-05-12 genetic and genomic medicine 10.64898/2026.05.08.26352429 medRxiv

Top 0.1%

23.0%

Show abstract

The incidence of vancomycin-resistant Enterococcus (VRE) is rising in hospitals in Canada, and resistance to last-resort antimicrobials including linezolid complicates treatment options for multidrug-resistant isolates. Recent reports from around the globe indicate that both linezolid and vancomycin resistance genes can be co-carried and mobilized by linear plasmids (named pELF) in Enterococcus species, often on the same backbone. We aimed to investigate linezolid resistance and linear plasmid prevalence in VRE bloodstream infection isolates collected by the Canadian Nosocomial Infection Surveillance Program from 2009 to 2024. We found that screening for pELF linear plasmid ends in short reads was a reliable way to predict linear plasmid presence in large-scale surveillance data (100 % accuracy on 85 reference samples). Almost half of the isolates in our collection were predicted to carry pELF plasmids (45.4 %, 941/2071) and we found that this proportion has increased from 2018 (32.2 %, 59/183) to 72 % of isolates between 2021 and 2024 (2021: 68.5 % (115/168); 2022: 71.6 % (146/204); 2023: 72.8 % (166/228); 2024: 71.6 % (235/328)). This trend of increasing linear plasmid carriage is evident from 2018 to 2024 across the dominant emerging sequence types (ST80, ST17, ST117). Linezolid resistance based on phenotypic antimicrobial susceptibility testing was low (1.0 %, 21/2071). Using long read sequencing, we characterized the linezolid resistant isolates and confirmed pELF plasmid presence in 13/21 (61.9 %) isolates. Six isolates harboured pELF plasmids encoding linezolid resistance genes (optrA, cfr(D), poxtA) and five of these also encoded vancomycin resistance genes (vanA). We compared these six plasmids to 39 public plasmid sequences and clustered them using MOB-suite and pling. Overall, this study provides further examples of the co-carriage of vancomycin and linezolid resistance genes on mobile linear plasmids and shows that linear plasmid prevalence is detectable and increasing across VRE in Canada. IMPACT STATEMENTGiven the increasing prevalence of multidrug-resistant hospital-acquired pathogens, resistance to last-resort antibiotics is a global public health threat. Linezolid is a last-resort antibiotic used to treat vancomycin-resistant Enterococcus isolates, and the dissemination of linezolid resistance genes is significantly facilitated by mobile elements that can transfer between unrelated strains and species. Linezolid resistance genes have recently been described on linear plasmids and are often co-localized with other resistance genes on the same plasmid backbone. Consequently, understanding the features and distribution of linear plasmids and those harbouring linezolid resistance genes is crucial for pathogen surveillance and mitigation of resistance. In this work, we used long-read and short-read sequencing to characterize genomic epidemiology of linear plasmids across 16 years of Enterococcus surveillance data in Canada. This study furthers knowledge of linear plasmids by demonstrating that they are relatively common across vancomycin-resistant Enterococcus blood isolates and by providing more examples of co-localized vancomycin and linezolid resistance genes on the same linear plasmid backbone. DATA SUMMARYSequencing data and genome sequences were deposited in National Centre for Biotechnology BioProject PRJNA1279082, and accessions are listed in Table S1. Supplementary materials for this study are available at the Figshare portal through DOI: XXX.

8

Multi-Algorithm Machine Learning Benchmarking for Pan-Cancer Classification from Tumour-Educated Platelet RNA Sequencing

Ray, S.; Zalawadia, D. H.; Bhate, V.; Chakravarthy, T. D.; Chetty, A. G.

2026-05-26 bioinformatics 10.64898/2026.05.22.727079 medRxiv

Top 0.1%

22.9%

Show abstract

Tumour-educated platelets (TEPs) carry cancer-type-specific RNA signatures accessible through whole-blood RNA sequencing, but systematic multi-algorithm benchmarking with quantified statistical uncertainty had not been applied to the GSE68086 dataset, the fields primary reference cohort. We applied an end-to-end transcriptomic and machine learning framework to 280 whole-blood platelet RNA-seq samples from six cancer types (non-small cell lung cancer, colorectal cancer, glioblastoma multiforme, hepatobiliary cancer, breast cancer, and pancreatic cancer) and healthy donors. After a standardised preprocessing and normalisation pipeline, seven supervised classifiers - Logistic Regression, SVM (RBF), XGBoost, LightGBM, Random Forest, K-Nearest Neighbours, and a Multilayer Perceptron were benchmarked using stratified 5-fold cross-validation and a held-out test set. Statistical uncertainty was quantified via 2,000-resample percentile bootstrap confidence intervals. Multinomial Logistic Regression achieved the highest test macro F1-score (0.522) and macro-averaged ROC-AUC (0.869), both substantially above the seven-class chance level (1/7 {approx} 0.14). SHAP analysis of the Random Forest classifier identified IFITM3 as the globally dominant TEP biomarker; cancer-type-specific discriminators included ATP5PD (hepatobiliary cancer), C6orf62 (NSCLC and pancreatic cancer), VPS13C (healthy donors), and TMSB4Y (breast cancer). Gene Ontology and KEGG pathway enrichment corroborated the biological specificity of identified transcriptomic signatures. These results support the diagnostic potential of TEP transcriptomics as a multi-class liquid biopsy platform and provide a methodologically transparent, reproducible reference framework for future blood-based cancer classification studies.

9

CSF-Seq enables transcriptome-wide profiling of cerebrospinal fluid and identifies prognostic signature of leptomeningeal disease

Hayden Gephart, M.; Umeh Garcia, M.; Barisano, G.; Nunez Perez, P.; Trinh, T.; Taiwo, R.; Herrick, D.; Roy-O'Reilly, M.; Lee, S.; Spiliotopoulous, E.; Weixel, C.; Burnside, G.; Godfrey, B.; Zhang, Y.; Chernikova, S.; Tosoni, S.; Granucci, M.; Riviere-Cazaux, C.; Coffey, G.; Villanueva, E.; Burns, T.; Nagpal, S.; Ngo, T.

2026-05-26 cancer biology 10.64898/2026.05.21.725787 medRxiv

Top 0.1%

22.4%

Show abstract

Leptomeningeal disease (LMD) is a rapidly fatal complication of systemic cancer for which sensitive diagnostic tools and informative biomarkers remain limited. Here, we introduce CSF-Seq, a method for whole-transcriptome sequencing of cell-free RNA (cfRNA) from human cerebrospinal fluid (CSF), designed to enable molecular profiling of LMD and other central nervous system (CNS) conditions. Using a prospectively collected CSF biobank, we analyzed 125 samples spanning multiple pathologies, including breast and lung LMD, glioblastoma, traumatic brain injury, and non-cancer neurological controls. Through optimized RNA extraction, library preparation, and deep sequencing, CSF-Seq generated robust and reproducible transcriptome-wide profiles despite the low abundance and fragmentation of cfRNA in CSF. CSF transcriptomes exhibited disease-specific expression, separating LMD from non-cancer controls and from non-LMD cancers, independent of CSF collection modality. Tumor-associated epithelial transcripts, including CEACAM6 and MUC1, were consistently enriched in LMD samples, whereas immune and CNS-associated transcripts were broadly detected across disease states, consistent with contributions from both tumor and non-tumor sources. Cross-site processing of matched samples demonstrated high concordance, indicating preservation of sample-specific transcriptional signatures across independent workflows. Importantly, we identified a collection method- independent LMD gene expression signature that was significantly associated with overall survival, supporting its potential prognostic relevance. Together, these findings establish CSF-Seq as a technically robust and clinically informative platform for transcriptomic biomarker discovery in CNS metastatic disease, offering a minimally invasive approach for disease characterization, risk stratification, and longitudinal monitoring in patients with LMD.

10

Exploratory dried blood spot metabolomics identifies pathway-level convergence with ME/CFS biology in a self-reported PEM-like fatigue phenotype

Hauguel, P.; Anctil, N.; Noel, L.-P.

2026-06-10 rheumatology 10.64898/2026.06.08.26355197 medRxiv

Top 0.1%

22.3%

Show abstract

Background. Plasma and serum metabolomic studies of myalgic encephalomyelitis / chronic fatigue syndrome (ME/CFS) have repeatedly implicated hypometabolic, lipid, mitochondrial, redox and tryptophan-kynurenine pathways, but prior cohorts have been modest in size and have used heterogeneous case definitions. Whether similar pathway-level signals are detectable at scale in dried blood spots (DBS), across questionnaire-derived fatigue constructs and across orthogonal LC gradients in the same individuals remains unresolved. Methods. We profiled DBS extracts from 1,784 community-cohort adults by reverse-phase LC-MS using paired 5 min and 15 min gradients. Six questionnaire-derived endpoints captured a pragmatic self-reported PEM-like phenotype, a DSQ-derived PEM-like construct, high or review clinical status, temporal fatigue state, comorbid fatigue and self-reported chronic fatigue. The locked primary endpoint for Phase 1 was pragmatic_fatigue_pem with 226 cases and 914 controls after excluding major metabolic comorbidity. We tested a biology-first panel comprising 22 literature-curated metabolites represented by four participant-level descriptors each, and evaluated three discovery extensions: a targeted m/z search of additional literature candidates, a hypothesis-free univariate screen across 4,553 5 min and 5,625 15 min consensus features, and pairwise z-difference ratios. Endpoint-specific Ridge classifiers were evaluated by five-fold out-of-fold AUC with bootstrap stability filtering. Cross-gradient agreement was assessed by per-metabolite AUC concordance between paired 5 min and 15 min profiles. Severity was modelled as an ordinal grade derived from the number of fatigue criteria met and chronic-fatigue-form status. Results. The biology-first DBS panel achieved out-of-fold AUC 0.81 for the pragmatic self-reported PEM-like endpoint (226 cases / 914 controls). The DSQ-derived PEM-like construct reached AUC 0.60 (57 cases / 201 controls) on the un-filtered set and AUC 0.778 (SD 0.013, twenty seeds) in a post-hoc signature-decomposition follow-up restricted to participants without a self-declared major-metabolic-history tag (29 cases / 230 controls); both are treated as construct-validity anchors rather than as provoked or clinically adjudicated PEM. An optimised operationalisation of the same construct (panel-self normalisation, restriction to non-comorbid participants and demographic covariates) reached AUC 0.71 (95 % CI 0.55 to 0.76), and an exploratory age-stratified signature decomposition suggested age-dependent pathway composition that requires confirmation given small per-stratum case counts. Stable contributors mapped to carnitine-shuttle, TCA-cycle, redox-thiol and tryptophan-kynurenine pathways. Cross-gradient analysis of 22 matched metabolites yielded Pearson r = 0.62 for signed univariate effects (p = 0.002; 68 % directional agreement). The metabolomic score increased with severity grade (Spearman rho = 0.45, p = 4 x 10^-91; median scores 0.24, 0.51 and 0.75 across grades 0, 1 and 2). Sensitivity analyses on the covariate-complete subset (n = 565; 138 cases / 427 controls) showed that the DBS signal was robust to adjustment for age, sex, BMI and medication burden (DBS-only AUC 0.76, DBS plus covariates 0.78, covariates only 0.64), and produced a metabolomic-specific lift of approximately 0.13 AUC over the strongest anti-leak declarative cross-form questionnaire baseline (AUC 0.63). DBS-only AUC was stable across sex, age and BMI subgroups, and a 1:4 nearest-neighbour matched analysis on age, sex and BMI yielded AUC 0.72 (95 % CI 0.67 to 0.77). The observed pattern supported pathway-level convergence with prior ME/CFS metabolomics literature, including carnitine shuttle, fatty-acid beta-oxidation, TCA cycle, redox-thiol, urea cycle, glycerophospholipid and tryptophan-kynurenine axes. In contrast, the hypothesis-free 15 min screen produced high-AUC features that mapped predominantly to environmental or technical signals, including pesticide, industrial-amine and mobile-phase artifact annotations; only one of eight top leads, a truncated oxidised phospholipid, was biologically plausible, and none had tandem-MS support. Conclusions. In this large community cohort, a literature-curated DBS metabolomic panel captured pathway-level biology associated with a questionnaire-derived PEM-like fatigue phenotype, showed directional concordance across LC gradients, scaled with symptom severity and remained robust to key demographic, anthropometric and anti-leak questionnaire baselines. The findings converge with several metabolic axes previously reported in ME/CFS plasma and serum studies, including carnitine-shuttle, TCA-cycle, redox-thiol, urea-cycle, glycerophospholipid and tryptophan-kynurenine pathways. They should not be interpreted as clinical validation of a diagnostic test, screening tool or objective provoked-PEM biomarker. Rather, they support at-home-compatible DBS metabolomics as a biologically grounded platform for future clinically adjudicated validation, decision-support development and longitudinal monitoring in fatigue and PEM-like syndromes. Because DBS contains cellular and plasma-derived components, matrix effects must be considered when comparing individual metabolites with venous plasma or serum studies, and hypothesis-free screening at this scale can preferentially surface exposome or technical variance unless molecular identification is enforced before biological interpretation.

11

dbGIST: An LLM-Assisted Multi-Omics Resource for Target Exploration and Cross-Dataset Validation in Gastrointestinal Stromal Tumors

Sun, Z.; Zhao, Q.; Li, J.-H.; Li, J.-J.; Liu, H.; Guo, Y.-X.; Tang, Y.-D.; Yang, F.; Liu, X.; Peng, S.-F.; Mi, W.-n.; Zhang, G.; Zhang, Z.; Yuan, M.-L.; Li, G.-H.; Wang, Y.-F.; Liu, C.; Li, S.-L.; Yang, J.-H.; Fu, Y.

2026-05-26 cancer biology 10.64898/2026.05.22.727292 medRxiv

Top 0.1%

22.0%

Show abstract

Gastrointestinal stromal tumors (GISTs) are the most common mesenchymal neoplasms of the gastrointestinal tract, yet GIST-specific omics evidence remains scattered across small cohorts and is not represented as a dedicated disease project in major cancer genomics resources, limiting reproducible target exploration. Here, we present dbGIST (https://www.dbgist.com), a dedicated GIST-focused multi-omics resource built to make dispersed GIST evidence searchable, analyzable, and reusable. dbGIST harmonizes data from 37 centers and 1,991 samples, including pathologically verified in-house cohorts, across genomics, bulk transcriptomics, proteomics, phosphoproteomics, and single-cell transcriptomics, and couples these data with curated clinical annotations covering survival, mutation status, risk stratification, metastasis or recurrence, mitotic index, tumor site and size, and imatinib response. The platform supports cohort-level molecular-clinical association, survival, enrichment, immune-infiltration, drug-sensitivity, and single-cell analyses through interactive visualizations, downloadable source data, and public APIs for programmatic access to reusable analysis outputs and visualization-ready data. An optional LLM-assisted interface helps users navigate analyses and interpret outputs. Using MCM7 as a case study, dbGIST linked a resource-derived candidate to survival, risk features, metastatic or recurrent disease, imatinib-response phenotypes, proliferative cell states, and in vitro GIST-cell behavior. dbGIST therefore provides a traceable and interoperable resource for target exploration and precision oncology research in GIST.

12

A multi-omic, spatial, and whole-slide image dataset of lung neuroendocrine tumours from the lungNENomics cohort

Kalson, L.; Sexton-Oates, A.; Mathian, E.; Voegele, C.; Di Genova, A.; Li, Z.; Kim, J.; Marsh, L. M.; Brcic, L.; Fernandez-Cuesta, L.; Foll, M.; Alcala, N.

2026-05-14 bioinformatics 10.64898/2026.05.12.724489 medRxiv

Top 0.1%

19.4%

Show abstract

Lung neuroendocrine tumours (lung NETs) are rare neoplasms comprising approximately 2% of lung cancers. Recent studies have identified distinct molecular groups based on transcriptome and methylome data, but genomic and morphological features remain underexplored due to limited whole-genome and imaging data. We have generated the largest multi-omic dataset of lung NETs to date (201 participants, for a total of n = 294 tumours), including RNA sequencing, EPIC 850K methylation arrays, and whole-genome sequencing. This multiomic dataset also include multi-regional whole-genome sequencing for 41 participants, allowing for the quantification of intra-tumoural heterogeneity. We additionally generated spatial proteomics (64 participants), spatial transcriptomics (4 participants) and whole-slide histopathology images for 212 cases. This dataset enables a comprehensive characterization of lung NET molecular groups and the identification of group-specific morphological features using deep learning algorithms. All quality control analyses, processed data, and scripts are provided to ensure reproducibility. This dataset is available as a basis for further molecular and morphological analysis of lung NETs, and for future research on multi-scale integration.

13

LVV SMRTcap reveals extensive proviral variation in lentiviral vector-transduced CAR T cells

Kaiser, C.; Sadri, G.; Elliott, E. M.; Mroczkowska, J. J.; Ankita, J.; Ferguson, M.; Bushman, F.; Fraietta, J. A.; Rouchka, E. C.; Smith, M.

2026-05-15 cancer biology 10.64898/2026.05.13.724601 medRxiv

Top 0.1%

18.5%

Show abstract

Lentiviral vectors are commonly used to introduce chimeric antigen receptor transgenes into T cells, but routine assays quantify vector copy number or integration sites without sequencing full-length integrated vectors. HIV-1 proviruses often acquire large deletions and cytidine deaminase-driven hypermutation; whether similar variation occurs in therapeutic lentiviral vectors is unclear. We adapted a novel long-read capture approach to enrich long fragments spanning vector DNA and adjacent human sequence, enabling simultaneous integration-site mapping and proviral integrity analysis with single-molecule resolution. In research-grade CAR T cells produced with an experimental, transient-transfection lentiviral vector workflow, 40% of integrated vectors carried recurrent deletions that removed the internal promoter or parts of the chimeric antigen receptor cassette. The dominant promoter deletion was present in the viral stock. In clinical chimeric antigen receptor T cell products, promoter deletions were less frequent, but detectable pre-infusion and post-infusion. Across datasets we observed widespread G-to-A substitutions consistent with restriction factor editing, including changes predicted to introduce premature stop codons within the transgene open reading frame. Our method reveals proviral variants invisible to standard quality-control assays and provides a framework to improve vector production and monitor transgene integrity in clinical products.

14

Automated Versus Manual Reanalysis In Rare Disease Genomics

Kaschta, D.; Arriens, V.; Mueller, S.; Utermann-Thuesing, C.; Vater, I.; Caliebe, A.; Nagel, I.; Spielmann, M.

2026-05-19 genetic and genomic medicine 10.64898/2026.05.16.26352295 medRxiv

Top 0.1%

18.3%

Show abstract

Purpose. Periodic reanalysis of genome sequencing data can yield additional diagnoses as knowledge evolves, yet manual reanalysis is labour-intensive. We compared automated and manual reanalysis approaches in rare disease genomics. Methods. We reanalyzed 377 rare disease cases: 158 with pathogenic or likely pathogenic (P/LP) findings, 49 with variants of uncertain significance (VUS) findings, and 170 had no findings. Manual reanalysis used standard diagnostic workflow for all cases without prior P/LP diagnoses (219 cases). An automated pipeline using Talos was benchmarked on the 158 P/LP cases before application to the 219-case reanalysis cohort. The mean reanalysis interval was 660 days. Results. Manual reanalysis identified three additional P/LP cases and two newly classified as VUS, increasing P/LP cases from 158 (41.9%) to 161 (42.7%). Talos recovered all three P/LP findings but only identified one of the two new VUS findings. Benchmarking showed 80.0% singleton concordance and 75.2% (82.8% proband-only) trio concordance, with approximately three variants per case. Conclusion. Reanalysis at 1.8 years yields modest but clinically meaning- ful gain. Automated reanalysis closely approximates manual performance while reducing hands-on effort, supporting scalable reanalysis in routine genomic care. Keywords: rare disease genomics, genome sequencing, automated reanalysis, variant prioritization, Talos, diagnostic yield

15

Vision-Based Genomic Model for Copy Number Variant Pathogenicity Prediction

Buralkin, I.; Botas, J.; Chang, K.-L.; Deng, Y.; Papastathopoulos-Katsaros, A.; Liu, Z.; Park, J.

2026-05-26 bioinformatics 10.64898/2026.05.21.726953 medRxiv

Top 0.2%

18.0%

Show abstract

Copy number variants (CNVs) are a major class of structural genomic alterations underlying rare disease, including neurodevelopmental delay and intellectual disability, yet predicting their pathogenicity remains challenging. Existing methods reduce CNVs to region-level numerical features, discarding the positional structure and cross-track patterns that expert clinical reviewers use to interpret genomic evidence. To address this, we introduce TO_SCPLOWESSERACTC_SCPLOW for CNV, a track-based spatial representation for CNV pathogenicity prediction, which represents each variant as a base-pair-resolution multi-track image and models spatial genomic patterns across annotation tracks while preserving positional structure and cross-track dependencies. Trained on a chromosome-level hold-out split of the ClinVar dataset, TO_SCPLOWESSERACTC_SCPLOW outperforms prior methods on held-out and curated noncoding benchmarks, improving AUROC by up to 0.10 over the state-of-the-art baseline. On the independent DECIPHER cohort, the model demonstrates generalizability by maintaining the highest AUROC and the highest F1 score across baselines. Furthermore, our model localizes pathogenic signals to clinically meaningful genomic subregions, providing track-annotated evidence that supports practical clinical interpretation.

16

Translational bioinformatics and machine learning framework for biomarker discovery, disease prediction, and patient profiling for precision medicine

Ahmed, Z.; Govindareddy, P.; DeGroat, W.; Narayanan, R.; Peker, E.; Zeeshan, S.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353961 medRxiv

Top 0.2%

17.2%

Show abstract

Precision medicine aims to advance our ability from a "one-size-fits-all" approach to personalized and predictive healthcare across diverse populations. It promotes integration of multi-omics and phenotypic data to understand disease mechanisms and discover novel biomarkers and risk factors, which could be used to predict and prevent critical diseases in individual patients across diverse populations. The potential implications of precision medicine approach can accelerate our ability to classify patients at higher risk of developing critical diseases, improve diagnostic capabilities, develop deeper understanding of individual risk, investigate racial differences and demographic characteristics, and find relationships between genetic variants, expressions, and diseases. This study focuses on implementing an innovative and data driven framework of translational bioinformatics and Machine Learning (ML) techniques to analyze multi-omics, including RNA-seq and Whole-Genome Sequencing (WGS) data, generated using blood samples of randomly consented patients. First, we utilized bioinformatics pipelines to identify differentially expressed genes and their pathogenic and likely pathogenic variants for the downstream data analysis, annotation, and visualization. Then, applied a nexus of ML models for multi-omics biomarker discovery, disease prediction, density-based clustering, single-patient profiling, and pathogenicity classification. WGS data analysis supported the exploration of genetic variation and diversity among patients to identify known and novel biomarkers, whereas RNA-seq data analysis improved our understanding of functional and biological pathways that underlying disease states. We classified and clustered pathogenic variants and expressions across various genes and discovered numerous diseases leading risk factors. Our results include gene-disease associations and captured common pathways across the broader population, demonstrating a level of sensitivity and accuracy that has broad clinical implications. We validated our results through clinical records, and state of the science literature. This study delves into the strengths of multi-omics data integration and capabilities of ML application in genetically diverse and complex patient cohorts. Our approach has the potential to elucidate complex gene-disease interactions for genetically diverse populations, which can support earlier diagnoses for patients in many disease realms.

17

cfMIND: A read-level methylation framework for accurate non- invasive disease detection using cell-free DNA

Li, J.; Liu, Z.; Zhang, H.; Zhang, Y.; Li, W.; Li, Y.

2026-05-16 bioinformatics 10.64898/2026.05.13.725033 medRxiv

Top 0.2%

17.2%

Show abstract

Plasma cell-free DNA (cfDNA) emerged as a promising non-invasive biomarker for cancers. However, reliable detection remains challenging due to the low abundance of tumor-derived cfDNA fragments and the dilution of informative methylation signals when aggregated into region-level features. Here, we propose a novel approach cfMIND, an efficient and robust machine-learning framework that leverages stratified read-level methylation signals to preserve rare cell-type-specific information and enhance detection sensitivity. By avoiding information loss inherent to conventional aggregation strategies, cfMIND enables more accurate and stable inference across diverse conditions. cfMIND is compatible with various cfDNA methylation sequencing technologies and cancer types. Across multiple cancer datasets (n = 868), cfMIND achieves high performance (AUROC up to 0.966) and maintains strong accuracy even at ultra-low sequencing depth (0.2x) and in early-stage cancers. Notably, cfMIND demonstrates exceptional robustness, generalizing effectively across cohorts and platforms without the need for model retraining. These results highlight its potential utility in heterogeneous experimental and clinically relevant settings. Beyond cancer detection, cfMIND is readily extendable to non-malignant diseases, as demonstrated by its ability to capture disease-associated methylation alterations in amyotrophic lateral sclerosis (ALS). Functional investigations on cfMIND-identified features further reveal enrichment in key regulatory regions implicated in disease pathogenesis and recapitulate tissue- and single-cell-level methylation and transcriptional programs underlying tumor biology. Collectively, cfMIND represents a significant advancement in the field, offering a broadly applicable, functionally interpretable, and high-resolution framework for non-invasive disease detection.

18

Integrative prioritization of clinically and biologically relevant long noncoding RNAs across gastrointestinal cancers

Flowers, B.; Lialios, P.; DiLollo, I.; Smith, N.; Whalley, J.; Lee, J.-S.

2026-05-29 cancer biology 10.64898/2026.05.26.728026 medRxiv

Top 0.2%

17.0%

Show abstract

Across gastrointestinal (GI) cancers, shared malignant programs are layered onto strong anatomical, lineage, and microenvironmental variation, making it difficult to distinguish disease-relevant long noncoding RNAs (lncRNAs) from context-dependent transcriptional signals. We developed a pan-GI integrative framework to classify lncRNAs across colorectal adenocarcinoma, gastric adenocarcinoma, and esophageal cancer using bulk and single-cell transcriptomic resources. This framework evaluates lncRNAs across four complementary dimensions: recurrent tumor-associated expression, clinical association with disease progression and overall survival, co-expression network context, and malignant epithelial expression at single-cell resolution. Paired tumor-normal RNA-seq analyses identified extensive tumor-associated lncRNA dysregulation and defined recurrent pan-GI lncRNAs consistently upregulated across cancer types. Clinical analyses further nominated transcripts linked to tumor extension, nodal involvement, metastatic dissemination, progression-linked expression, and adverse overall survival. Co-expression network analysis identified lncRNAs embedded within disease-associated transcriptional modules, providing functional context for otherwise poorly annotated transcripts. In parallel, single-cell-derived metacell analysis nominated malignant epithelial-associated and detection-supported lncRNAs, helping distinguish tumor-compartment-associated signals from stromal, immune, endothelial, and other microenvironmental contributions. Together, this study establishes an evidence-structured pan-GI lncRNA resource and a generalizable prioritization strategy for nominating disease-associated noncoding transcripts. More broadly, the framework provides a transferable strategy for systematic lncRNA prioritization across other cancers and heterogeneous disease contexts.

19

AFQuery: a bitmap-indexed, capture-aware allele frequency engine for clinical genomics cohorts

Santos-Diaz, G.; Toro-Barrios, N.; Carmona, R.; Uria-Regojo, G.; Jimenez-Arias, R.; Gurriaran, X.; Ramilo, P.; Amigo, J.; Minguez, P.; Dopazo, J.; Lopez-Lopez, D.

2026-05-22 genetic and genomic medicine 10.64898/2026.05.15.26353174 medRxiv

Top 0.2%

14.7%

Show abstract

Motivation: Allele frequency (AF) is central to clinical variant classification under ACMG/AMP guidelines. Public reference databases offer broad ancestry coverage, but local ancestries, rare-disease enrichment, and institutional case distributions are often underrepresented, so cohort-derived AF is a valuable complement. Computing accurate AF from institutional cohorts is nonetheless error-prone: even successive versions of the same capture kit cover substantially different target regions, and naive methods inflate the allele number (AN) at positions not shared by all kits, deflating AF and biasing ACMG frequency evidence toward pathogenic categories. Results: We present AFQuery, a bitmap-indexed AF engine that computes capture-aware, ploidy-aware allele frequencies from pre-indexed Roaring Bitmaps in {approx}14 ms per point query ({approx}34 ms for 1-Mbp region queries), independently of cohort size up to 50,000 samples. In simulated mixed-technology cohorts, capture-aware AN reduced AF mean absolute error 8-13-fold and removed the systematic bias toward pathogenic ACMG categories, yielding 10-45-fold fewer spurious pathogenic-evidence calls. Availability: AFQuery is freely available under the MIT licence at https://github.com/babelomics/afquery.

20

RANKOR: Direct Drug Prioritization from Bulk and Single-Cell Transcriptomic Signatures

Katsaouni, N.; Schulz, M. H.

2026-05-21 bioinformatics 10.64898/2026.05.20.726471 medRxiv

Top 0.2%

14.6%

Show abstract

BackgroundPrioritizing therapeutics from transcriptomic data remains a key challenge in precision medicine. Signature reversal approaches, most commonly implemented through Gene Set Enrichment Analysis (GSEA), have been widely used to match disease signatures to candidate drugs. However, enrichment-based methods can be sensitive to noise and are restricted to previously profiled compounds MethodsWe developed RANKOR, a machine-learning framework designed to rank candidate drugs directly from transcriptomic signatures. Rather than predicting full expression profiles, RANKOR learns structured latent representations of transcriptional responses alongside chemical structure, enabling prioritization from standardized signatures derived from disease states or treatment perturbations. The framework is applicable to both bulk and single-cell transcriptomic data. ResultsAcross large-scale perturbational datasets, RANKOR achieved consistently lower median ranks than similarity- and distance-based approaches, while showing performance comparable to, and in some settings improved over, GSEA. The model generalized across unseen cell types and retained performance in single-cell settings, where it provided more consistent prioritization than existing approaches, such as ASGARD. RANKOR further enabled prioritization of transcriptionally unseen compounds through chemical-space embedding and achieved substantially reduced computation times. Robustness analyses demonstrated stable performance under moderate noise and degradation under extreme perturbation or gene shuffling. Gene attribution analyses indicated that prioritization decisions are driven by coherent and mechanism-relevant transcriptional programs. ConclusionsRANKOR provides a scalable framework for transcriptomics-guided drug prioritization that can complement and extend existing approaches, such as GSEA. It can also support therapeutic hypothesis generation from bulk and single-cell data while leveraging the generalisability and computational efficiency of machine learning models.